In this project we set out to work with an assigned dataset referred to as the “20 newsgroup” dataset. Our goal is to preprocess the dataset. We will clean the data and build a vocabulary. We will visualize a set of statistics for this preprocessed data. Next we will train an LDA model on our dataset. From there we will create a vector representation of this dataset by training a Doc2Vec model. The report will also compare and contrast visualizations from these processes.
| category | docCount | sentCount | wordCount | numUniqueWords | meanSentLength | minSentLength | maxSentLength | stdSentLength |
|---|---|---|---|---|---|---|---|---|
| comp.windows.x | 593 | 9166 | 75989 | 4992 | 8.290312 | 1 | 204 | 8.721868 |
| comp.os.ms-windows.misc | 591 | 10398 | 119751 | 4505 | 11.516734 | 1 | 1170 | 43.798733 |
| talk.politics.misc | 465 | 11603 | 91194 | 7575 | 7.859519 | 1 | 117 | 7.192296 |
| comp.sys.ibm.pc.hardware | 590 | 6919 | 50143 | 4485 | 7.247146 | 1 | 185 | 7.490722 |
| talk.religion.misc | 377 | 7843 | 57363 | 6738 | 7.313911 | 1 | 174 | 6.795975 |
| rec.autos | 594 | 7645 | 58613 | 5815 | 7.666841 | 1 | 117 | 6.981358 |
| sci.space | 593 | 9366 | 86337 | 7516 | 9.218129 | 1 | 606 | 11.898430 |
| talk.politics.guns | 546 | 11615 | 90137 | 7363 | 7.760396 | 1 | 130 | 7.111474 |
| alt.atheism | 480 | 10094 | 70099 | 6785 | 6.944621 | 1 | 75 | 5.678931 |
| misc.forsale | 585 | 5031 | 40633 | 5295 | 8.076526 | 1 | 321 | 13.065767 |
| comp.graphics | 584 | 7574 | 59159 | 5529 | 7.810800 | 1 | 177 | 7.774651 |
| sci.electronics | 591 | 7299 | 55248 | 5510 | 7.569256 | 1 | 142 | 6.635209 |
| sci.crypt | 595 | 12105 | 105026 | 7202 | 8.676250 | 1 | 279 | 8.544165 |
| soc.religion.christian | 599 | 12869 | 93529 | 7714 | 7.267775 | 1 | 83 | 5.540344 |
| rec.sport.hockey | 600 | 9879 | 72776 | 5561 | 7.366738 | 1 | 527 | 11.625689 |
| sci.med | 594 | 10108 | 82824 | 8107 | 8.193906 | 1 | 412 | 8.715710 |
| rec.motorcycles | 598 | 7393 | 53783 | 6073 | 7.274855 | 1 | 85 | 6.663944 |
| comp.sys.mac.hardware | 578 | 6368 | 45276 | 4441 | 7.109925 | 1 | 88 | 6.234718 |
| talk.politics.mideast | 564 | 16341 | 122177 | 8161 | 7.476715 | 1 | 284 | 6.935771 |
| rec.sport.baseball | 597 | 8268 | 55240 | 4931 | 6.681180 | 1 | 158 | 7.167797 |
| total | 11314 | 181160 | 1357526 | 23515 | 7.493519 | 1 | 1159 | 12.828900 |
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Found a grid already named: 'box-doc-stats Grid'. Since fileopt='overwrite', I'll try to update it
## Found a plot already named: 'box-doc-stats'. Since fileopt='overwrite', I'll try to update it